arXiv:1508.04389vl [cs.CV] 18 Aug 2015 


A Deep Pyramid Deformable Part Model for Face Detection 


Rajeev Ranjan, Vishal M. Patel, Rama Chellappa 
Center for Automation Research 
University of Maryland, College Park, MD 20742 

{rranjanl, pvishalm, rama}@umiacs.umd.edu 


Abstract 

We present a face detection algorithm based on De¬ 
formable Part Models and deep pyramidal features. The 
proposed method called DP2MFD is able to detect faces 
of various sizes and poses in unconstrained conditions. It 
reduces the gap in training and testing ofDPMon deep fea¬ 
tures by adding a normalization layer to the deep convolu¬ 
tional neural network (CNN). Extensive experiments on four 
publicly available unconstrained face detection datasets 
show that our method is able to capture the meaningful 
structure of faces and performs significantly better than 
many competitive face detection algorithms. 

1. Introduction 

Face detection is a challenging problem that has been 
actively researched for over two decades (371, (361. Cur¬ 
rent methods work well on images that are captured un¬ 
der user controlled conditions. However, their performance 
degrades significantly on images that have cluttered back¬ 
grounds and have large variations in face viewpoint, expres¬ 
sion, skin color, occlusions and cosmetics. 

The seminal work of Viola and Jones (32l has made face 
detection feasible in real world applications. They use cas¬ 
caded classifiers on Haar-like features to detect faces. The 
cascade structure has been a subject of extensive research 
since then. Cascade detectors work well on frontal faces, 
however, sometimes they fail to detect profile or partially 
occluded faces. A recently developed joint cascade-based 
method m yields improved detection performance by in¬ 
corporating a face alignment step in the cascade structure. 
Headhunter ||25]| uses rigid templates along similar lines. 
The method based on Aggregate Channel Features (ACF) 
(34l deploys a cascade of channel features while Pixel In¬ 
tensity Comparisons Organized (Pico) (24ll uses a cascade 
of rejectors for improved face detection. 

Most of the recent face detectors are based on the De¬ 
formable Parts Model (DPM) structure 0 where a face is 
defined as a collection of parts. These parts are trained side- 


by-side with the face using a spring-like constraint. They 
are fine-tuned to work efficiently with the HOG O features. 
A unified approach for face detection, pose estimation and 
landmark localization using the DPM framework was re¬ 
cently proposed in (38ll . This approach defined a “part” at 
each facial landmark and used mixture of tree-structured 
models resilient to viewpoint changes. A properly trained 
simple DPM is shown to yield significant improvement for 
face detection in (25ll . 

The key challenge in unconstrained face detection is that 
features like Haar wavelets and HOG do not capture the 
salient facial information at different poses and illumina¬ 
tion conditions. The limitation is more due to the features 
used than the classifiers. However, with recent advances in 
deep learning techniques and the availability of GPUs, it is 
becoming possible to use deep Convolutional Neural Net¬ 
works (CNN) for feature extraction. In has been shown in 
Cll that a deep CNN pretrained with a large generic dataset 
such as Imagenet (H, can be used as a meaningful feature 
extractor. The deep features thus obtained have been used 
extensively for object detection. For instance. Regions with 
CNN (R-CNN) (71 computes regions-based deep features 
and attains state-of-art on the Imagenet challenge. Meth¬ 
ods like Overfeat (28l and Densenet nni adopt a sliding 
window approach to detect objects from the poo /5 features. 
Deep Pyramid ( 8 | and Spatial Pyramid (9l remove the fixed- 
scale input dependency from deep CNNs which makes them 
attractive to be integrated with DPMs. Although, a lot of 
research on deep learning has focused on object detection 
and classification, very few have used deep features for face 
detection which is equally challenging because of high vari¬ 
ations in pose, ethnicity, occlusions, etc. It was shown in O 
that deep CNN features fine-tuned on faces are informative 
enough for face detection, and hence do not require an S VM 
classifier. They detect faces based on the heat map score ob¬ 
tained directly from the fifth convolutional layer. Although 
they report competitive results, detection performance for 
faces of various sizes and occlusions needs improvement. 

In this paper, we propose a face detector which detects 
faces at multiple scales, poses and occlusion by efficiently 



Figure 1. Overview of our approach. (1) An image pyramid is built from a color input image with level 1 being the lowest size. (2) Each 
pyramid level is forward propagated through a deep pyramid CNN fS) that ends at max variant of convolutional layer 5 (max 5 ). (3) The 
result is a pyramid of max 5 feature maps, each at 1/16th the spatial resolution of its corresponding image pyramid level. (4) Each max 5 
level features is normalized using z-score to form norm^ feature pyramid. (5) Each norm^ feature level gets convoluted with every 
root-filter of a C-component DPM to generate a pyramid of DPM score (6). The detector outputs a bounding box for face location (7) in 
the image after non-maximum suppression and bounding box regression. 


integrating deep pyramid features El with DPMs. This pa¬ 
per makes the following contributions: 

1. We propose a novel method for training DPM for faces 
using deep pyramidal features. 

2. We propose adding a normalization layer to the deep 
CNN to reduce the bias in face sizes. 

3. We achieve new state-of-the-art detection perfor¬ 
mances on four challenging face detection datasets. 

This paper is organized as follows. Section describes 
our proposed face detector in detail. Section [^provides the 
detection results on four challenging datasets. Finally, Sec¬ 
tion concludes the paper with a brief summary and dis¬ 
cussion. 

2. Face Detection with Deep Pyramid DPM 

Our proposed face detector, called Deep Pyramid De¬ 
formable Parts Model for Face Detection (DP2MFD), con¬ 
sists of two modules. The first one generates a seven level 
normalized deep feature pyramid for any input image of ar¬ 
bitrary size. Fixed-length features from each location in the 
pyramid are extracted using the sliding window approach. 
The second module is a linear SVM which takes these fea¬ 
tures as input to classify each location as face or non-face, 
based on their scores. In this section, we provide the design 
details of our face detector and describe its training and test¬ 
ing processes. 

2.1. DPM Compatible Deep Feature Pyramid 

We build our model using the feature pyramid network 
implementation provided in M- It takes an input image of 


variable size and constructs an image pyramid with seven 
levels. Each level is embedded in the upper left corner of 
a large (1713 x 1713 pixels) image and maintains a scale 
factor of with its next lower level in the hierarchy. Us¬ 
ing this image pyramid, the network generates a pyramid of 
256 feature maps at the fifth convolution layer (conv^). A 
3x3 max filter is applied to the feature pyramid at a stride 
of one to obtain the max^ layer which essentially incorpo¬ 
rates the conv^ “parts” information. Hence, it suffices to 
train a root-only DPM on the max^ feature maps without 
explicitly training on DPM parts. A cell at location (j^k) 
in the max^ layer corresponds to the pixel (16j, 16k) in the 
input image, with a highly overlapping receptive field of 
size 163 X 163 pixels. Despite having a large receptive field 
, the features are well localized to be effective for sliding 
window detectors. 

It has been suggested in ID that deep feature pyramids 
can be used as a replacement for HOG Pyramid in DPM im¬ 
plementation. However, this is not entirely obvious as deep 
features are different than HOG features in many aspects. 
Firstly, the deep features from max 5 layer have a receptive 
field of size 163 x 163 pixels, unlike HOG where the recep¬ 
tive region is localized to a bin of 8 x 8 pixels. As a result, 
max 5 features at face locations in the test images would 
be substantially different from that of a cropped face. This 
prohibits us from using the deep features of cropped faces 
as positive training samples, which is usually the first step 
in training HOG-based DPM. Hence, we take a different 
approach of collecting positive and negative training sam¬ 
ples from the deep feature pyramid itself. This procedure is 
described in detail in subsection 12. 3 1 

Secondly, the deep pyramid features lack the normaliza- 
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Figure 2. Comparison between HOG, max^ and norm^ feature pyramids. In contrast to max^ features which are scale selective, norm^ 
features have almost uniform activation intensities across all the levels. 


tion attribute associated with HOG. The feature activations 
vary widely in magnitude across the seven pyramid levels as 
shown in Figure Typically, the activation magnitude for 
a face region decreases with the size of pyramid level. As a 
result, a large face detected by a fixed-size sliding window 
at a lower pyramid level will have a high detection score 
compared to a small face getting detected at a higher pyra¬ 
mid level. In order to reduce this bias to face size, we ap¬ 
ply a z-score normalization step on the max^ features at 
each level. For a 256-dimensional feature vector Xi^^k at 
the pyramid level i and location (j, k), the normalized fea¬ 
ture Xij^k is computed as: 


where jii is the mean feature vector, and is the standard 
deviation for the pyramid level i. We refer to the normalized 
max^ features as ''norm^'\ A root-only DPM is trained on 
the norm^ feature pyramid using a linear SVM. Figure 
shows the complete overview of our model. 

2.2. Testing 

At test time, each image is fed to the model described 
above to obtain the norms feature pyramid. They are con¬ 
volved with the fixed size root-filters for each component of 
DPM in a sliding window fashion, to generate a detection 
score at every location of the pyramid. Locations having 
scores above a certain threshold are mapped to their cor¬ 
responding regions in the image. These regions undergo a 
greedy non-maximum suppression to prune low scoring de¬ 
tection regions with Intersection-Over-Union (lOU) overlap 
above 0.3. In order to localize the face as accurately as pos¬ 
sible, the selected boxes undergo bounding box regression. 


Owing to the subsampling factor of 16 between the input 
image and norms layer, the total number of sliding win¬ 
dows account to approximately 25k compared to approxi¬ 
mately 250k for the HOG pyramid, which reduces the ef¬ 
fective test-time. 

2.3. Training 

For training, both positive and negative faces are sam¬ 
pled directly from the norms feature pyramid. The dimen¬ 
sions of root filters for DPM are decided by the aspect ratio 
distribution for faces in the dataset. The root-filter sizes 
are scaled down by a factor of 8 to match the face size in 
the feature pyramid. Since, a given training face maps its 
bounding box at each pyramid level, we choose the optimal 
level I for the corresponding positive sample by minimizing 
the sum of absolute difference between the dimensions of 
bounding box and the root filter at each level. For a root- 
filter of dimension (h^w) and bounding box dimension of 
( 6 ^, ) for the pyramid level i, / is given by 

I = argmin Ib^ — h\ A \bf — w\. (2) 

The ground truth bounding box at level I is then resized to 
fit the DPM root-filter dimensions. We finally extract the 
''norms'' feature of dimension hxwx 256 from the shifted 
ground truth position in the level I as a positive sample for 
training. 

The negative samples are collected by randomly choos¬ 
ing root-filter sized boxes from the normalized feature pyra¬ 
mid. Only those boxes having lOU less than 0.3 with the 
ground truth face at the particular level are considered as 
negative samples for training. 

Once the training features are extracted, we optimize 
a linear SVM for each component of the root-only DPM. 


















Since the training data is large to fit in the memory, we 
adopt the standard hard negative mining method tSTlIfill to 
train the SVM. We also train a bounding box regressor to 
localize the detected face accurately. The procedure is sim¬ 
ilar to the bounding box regression used in R-CNN (Tl , the 
only difference being our bounding box regressor is trained 
on the norm^ features. 

3. Experimental Results 

We evaluated the proposed deep pyramid DPM face de¬ 
tection method on four challenging face detection datasets 
- Annotated Face in-the-Wild (AFW) (381, Face Detection 
Dataset and Benchmark (FDDB) (TTl . Multi-Attribute La¬ 
belled Faces (MALF) (33 and the lARPA Janus Bench¬ 
mark A (IJB-A) dSl, O dataset. We train our detec¬ 
tor on the FDDB images using Caffe (131 for both 1- 
component (DP2MFD-lc) and 2-components (DP2MFD- 
2c) DPM. The FDDB dataset was evaluated using the 10- 
fold cross-validation approach. For evaluating the AFW 
and the MALF datasets, images from all the 10 splits of 
the FDDB dataset were used as training samples. 

3.1. AFW Dataset Results 

The AFW dataset (^ contains 205 images with 468 
faces collected from Flickr. Images in this dataset con¬ 
tain cluttered backgrounds with large variations in both face 
viewpoint and appearance. 



Figure 3. Performance evaluation on the AFW dataset. 

The precision-recall curves of different academic as 

^The results of the methods other than our DP2MFD methods com¬ 
pared in Figure were provided by the authors of Ei, □ and (20) 


well as commercial methods on the AFW dataset are shown 
in Figure Some of the academic face detection meth¬ 
ods compared in Figure include OpenCV implementa¬ 
tions of the 2-view Viola-Jones algorithm, DPM (61, mix¬ 
ture of trees (Zhu et al.) (38l, boosted multi-view face de¬ 
tector (Kalal et al.) iMl, boosted exemplar (^ and the 
joint cascade methods m As can be seen from this fig¬ 
ure, our method outperforms most of the academic detec¬ 
tors and performs comparably to a recently introduced joint 
cascade-based method m and the best commercial face de¬ 
tector Google Picassa. Note that the joint cascade-based 
method m uses face alignment to make the detection better 
and trains the model on 20,000 images. In contrast, we do 
not use any alignment procedure in our detection algorithm 
and train on only 2,500 images. 

3.2. FDDB Dataset Results 

The FDDB dataset CH is the most widely used bench¬ 
mark for unconstrained face detection. It consists of 2,845 
images containing a total of 5,171 faces collected from news 
articles on the Yahoo website. All images were manu¬ 
ally localized for generating the ground truth. The FDDB 
dataset has two evaluation protocols - discrete and continu¬ 
ous which essentially correspond to coarse match and pre¬ 
cise match between the detection and the ground truth, re¬ 
spectively. 

Figure compares the performance of different aca¬ 
demic and commercial detectors using the Receiver Oper¬ 
ating Characteristic (ROC) curves on this dataset. The aca¬ 
demic algorithms compared in Figure |^a)-(b) include Yan 
et al. (33]| . boosted exemplar (20l, SURF frontal and multi¬ 
view (23, PEP adapt (TH, XZJY (29l, Zhu et al. (38l . 
Segui et al. (27l, Koestinger et al. (13, Li et al. tTH . Jain 
et al. dll, Subburaman et al. llOl, Viola-Jones (HI, Miko- 
lajczyk et al. l26l . Kienzle et al. ca and the commercial 
algorithms compared in Figure |^c)-(d) include Face-f-f, the 
Olaworks face detector, the IlluxTech frontal face detector 
and the Shenzhen University face detector 

As can be seen from this figure, our method significantly 
outperforms all previous academic and commercial detec¬ 
tors under the discrete protocol and performs comparably 
to the previous state-of-the-art detectors under the continu¬ 
ous protocol. A decrease in performance for the continuous 
case is mainly because of low lOU score obtained in match¬ 
ing our detectors’ rectangular bounding box with elliptical 
ground truth mask for the FDDB dataset. 

We also implemented an R-CNN method for face detec¬ 
tion and evaluated it on the FDDB dataset. The R-CNN 
method basically selects face independent candidate regions 
from the input image and computes a 4096 dimensional fcj 
feature vector for each of them. An SVM trained on /cy 
features classifies each region as face or non-face based on 

^http://vis-www.cs.umass.edu/fddb/results.html 
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Figure 4. Performance evaluation on the FDDB dataset, (a) and (b) compare our method with previously published methods under the 
discrete and continuous protocols, respectively. Similarly, (c) and (d) compare our method with commercial systems under the discrete and 
continuous protocols, respectively. 


the detection score. The method represented by “RCNN- 
face” performs better than most of the academic face detec¬ 
tors Emm [191. This shows the dominance of deep CNN 
features over HOG, SURF. However, RCNN-Face’s perfor¬ 
mance is inferior to the DP2MFD method as the region se¬ 
lection process might miss a face from the image. 

3.3. MALF Dataset Results 

The MALF dataset 1351 consists of 5,250 high-resolution 
images containing a total of 11,931 faces. The images were 
collected from Flickr and image search service provided 
by Baidu Inc. The average image size in this dataset is 
573 X 638. On average, each image contains 2.27 faces with 
46.97% of the images contain one face, 43.41% contain 2 
to 4 faces, 8.30% contain 5 to 9 faces and 1.31% images 
contain more than 10 faces. Since this dataset comes with 
multiple annotated facial attributes, evaluations on attribute- 
specific subsets are proposed. Different subsets are defined 
corresponding to different combinations of attribute labels. 
In particular, ‘easy’ subset contains faces without any large 
pose, occluded or exaggerated expression variations and are 
larger than 60 x 60 in size and ‘hard’ subset contains faces 
that are larger than 60 x 60 in size with one of extreme 
pose or expression or occlusion variations. Furthermore, 
scale-specific evaluations are also proposed in which algo¬ 
rithms are evaluated on two subsets - ‘small’ and ‘large’. 
The ‘small’ subset contains images that have size smaller 
than 60 x 60 and the ‘’large’ subset contains images that 
have size larger than 90 x 90. 

The performance of different algorithms, both from 
academia and industry, are compared in Figure |^by plot¬ 
ting the True Positive Rate vs. False Positive Per Images 
curves Some of the academic methods compared in Fig- 

^The results of the methods other than our DP2MFD methods com- 


ureginclude ACF (SI, DPM (23, Exemplar method (20l, 
Headhunter (23, TSM (381, Pico (21, NPD (23 and W. 
S. Boost IT4l . From Figure [^a), we see that overall the 
performance of our DP2MFD method is the best among the 
academic algorithms and is comparable to the best commer¬ 
cial algorithms FacePP-v2 and Picasa. 

In the ‘small’ subset, denoted by < 30 height in Fig¬ 
ure [^b), the performance of all algorithms drop a little 
but our DP2MFD method still performs the best among the 
other academic methods. On the ’large’, ’easy, and ’hard’ 
subsets, the DPM method (25\ performs the best and our 
DP2MFD method performs the second best as shown in 
Figure [3c), (d) and (e), respectively. The DPM and Head¬ 
hunter E3 are better as they train multiple models to fully 
capture faces in all orientations, apart from training on more 
than 20,000 samples. 

We provide the results of our method for the lOU of 0.35 
as well as 0.5 in Figurej^ Since the non-maximum suppres¬ 
sion ensures that no two detections can have IOU> 0.3, the 
decrease in performance for lOU of 0.5 is mainly due to im¬ 
proper bounding box localization. One of the contributing 
factors might be the localization limitation of CNNs due to 
high amount of sub-sampling. In future, we plan to analyze 
this issue in detail. 

3.4. IJB-A Dataset Results 

The IJB-A dataset contains images and videos from 500 
subjects collected from online media CSl, 0. In total, 
there are 67,183 faces of which 13,741 are from images and 
the remaining are from videos. The locations of all faces in 
the IJB-A dataset were manually ground truthed by human 
annotators. The subjects were captured so that the dataset 
contains wide geographic distribution. All face bounding 

pared in Figurej^were provided by the authors of (35). 
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Figure 5. Fine-grained performance evaluation on the MALF dataset, (a) on the whole test set, (b) on the small faces sub-set, (c) on the 
large faces sub-set, (d) on the ‘easy’ faces sub-set and (e) on the ‘hard’ faces sub-set. 


boxes are about 36 pixels or larger. 

Nine different face detection algorithms were evaluated 
on this dataset in Q. Some of the algorithms compared 
in include one commercial off the shelf (COTS) algo¬ 
rithm, three government off the shelf (GOTS) algorithms, 
two open source face detection algorithms (OpenCV’s Vi¬ 
ola Jones and the detector provided in the Dlib library), and 
PittPat ver 4 and 5. In Figure (a) and (b) we show the pre¬ 
vision vs. recall curves and the ROC curves, respectively 
corresponding to our method and one of the best reported 
methods in m As can be seen from this figure, our method 
outperforms the best performing method reported in (21 by 
a large margin. 

3.5. Discussion 

Its clear from these results that our DP2MFD-2c method 
performs slightly better than the DP2MFD-lc method. This 
can be attributed to the fact that the aspect ratio of face 
doesn’t change much with pose. Figure [7] shows several de¬ 
tection results on the four datasets. It can be seen from this 
figure, that our method is able to detect profile faces as well 
as different size faces in images with cluttered background. 

3.6. Runtime 

Our face detector was tested on a machine with 4 cores, 
12GB RAM, and 1.6GHz processing speed. No GPU was 
used for processing. The model DP2MFD-lc took about 


24.5s on average to evaluate a face, whereas DP2MFD-2c 
took about 26s. The deep pyramid feature evaluation took 
around 23s. It can certainly be reduced to 0.5s (D by using 
Tesla K20 GPU for feature extraction. 

4. Conclusions 

In this paper, we presented a method for unconstrained 
face detection which essentially trains DPM for faces on 
deep feature pyramid. One of the interesting features of our 
algorithm is that we add a normalization layer to the deep 
CNN which reduces the bias in face sizes. Extensive exper¬ 
iments on four publicly available unconstrained face detec¬ 
tion datasets demonstrate the effectiveness of our proposed 
approach. 

Our future work will include a GPU implementation of 
our method for reducing the computing time. We will also 
evaluate the performance of our method on other object de¬ 
tection datasets. 
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Figure 6. Performance evaluation on the IJB-A dataset, (a) Precision vs. recall curves, (b) ROC curves. 
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